AITopics | word class

Collaborating Authors

word class

Information about AI from the News, Publications, and Conferences

Automatic Classification – Tagging and Summarization – Customizable Filtering and Analysis

If you are looking for an answer to the question What is Artificial Intelligence? and you only have a minute, then here's the definition the Association for the Advancement of Artificial Intelligence offers on its home page: "the scientific understanding of the mechanisms underlying thought and intelligent behavior and their embodiment in machines."

However, if you are fortunate enough to have more than a minute, then please get ready to embark upon an exciting journey exploring AI (but beware, it could last a lifetime) …

A Computational Approach to Analyzing Language Change and Variation in the Constructed Language Toki Pona

Huang, Daniel, Joo, Hyoun-A

arXiv.org Artificial IntelligenceAug-15-2025

This study explores language change and variation in Toki Pona, a constructed language with approximately 120 core words. Taking a computational and corpus-based approach, the study examines features including fluid word classes and transitivity in order to examine (1) changes in preferences of content words for different syntactic positions over time and (2) variation in usage across different corpora. The results suggest that sociolinguistic factors influence Toki Pona in the same way as natural languages, and that even constructed linguistic systems naturally evolve as communities use them.

artificial intelligence, natural language, oki pona, (11 more...)

arXiv.org Artificial Intelligence

2508.10246

Country: Europe (0.14)

Genre:

Research Report > New Finding (0.49)
Research Report > Experimental Study (0.35)

Technology: Information Technology > Artificial Intelligence > Natural Language > Grammars & Parsing (1.00)

Add feedback

Towards few-shot isolated word reading assessment

Smit, Reuben, Louw, Retief, Kamper, Herman

arXiv.org Artificial IntelligenceJul-17-2025

We explore an ASR-free method for isolated word reading assessment in low-resource settings. Our few-shot approach compares input child speech to a small set of adult-provided reference templates. Inputs and templates are encoded using intermediate layers from large self-supervised learned (SSL) models. Using an Afrikaans child speech benchmark, we investigate design options such as discretising SSL features and barycentre averaging of the templates. Idealised experiments show reasonable performance for adults, but a substantial drop for child speech input, even with child templates. Despite the success of employing SSL representations in low-resource speech tasks, our work highlights the limitations of SSL representations for processing child data when used in a few-shot classification system.

artificial intelligence, machine learning, template, (16 more...)

arXiv.org Artificial Intelligence

2507.12217

Country: Africa (0.14)

Genre: Research Report > New Finding (0.94)

Industry: Education (1.00)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Performance Analysis > Accuracy (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (0.68)

Add feedback

Extending dependencies to the taggedPBC: Word order in transitive clauses

Ring, Hiram

arXiv.org Artificial IntelligenceJun-10-2025

The taggedPBC (Ring 2025a) contains more than 1,800 sentences of pos-tagged parallel text data from over 1,500 languages, representing 133 language families and 111 isolates. While this dwarfs previously available resources, and the POS tags achieve decent accuracy, allowing for predictive crosslinguistic insights (Ring 2025b), the dataset was not initially annotated for dependencies. This paper reports on a CoNLLU-formatted version of the dataset which transfers dependency information along with POS tags to all languages in the taggedPBC. Although there are various concerns regarding the quality of the tags and the dependencies, word order information derived from this dataset regarding the position of arguments and predicates in transitive clauses correlates with expert determinations of word order in three typological databases (WALS, Grambank, Autotyp). This highlights the usefulness of corpus-based typological approaches (as per Baylor et al. 2023; Bjerva 2024) for extending comparisons of discrete linguistic categories, and suggests that important insights can be gained even from noisy data, given sufficient annotation. The dependency-annotated corpora are also made available for research and collaboration via GitHub.

artificial intelligence, dependency, natural language, (17 more...)

arXiv.org Artificial Intelligence

2506.06785

Country:

Asia (0.46)
Europe (0.46)

Genre: Research Report (1.00)

Technology: Information Technology > Artificial Intelligence > Natural Language > Grammars & Parsing (0.88)

Add feedback

A Grounded Typology of Word Classes

Haley, Coleman, Goldwater, Sharon, Ponti, Edoardo

arXiv.org Artificial IntelligenceDec-13-2024

We propose a grounded approach to meaning in language typology. We treat data from perceptual modalities, such as images, as a language-agnostic representation of meaning. Hence, we can quantify the function--form relationship between images and captions across languages. Inspired by information theory, we define "groundedness", an empirical measure of contextual semantic contentfulness (formulated as a difference in surprisal) which can be computed with multilingual multimodal language models. As a proof of concept, we apply this measure to the typology of word classes. Our measure captures the contentfulness asymmetry between functional (grammatical) and lexical (content) classes across languages, but contradicts the view that functional classes do not convey content. Moreover, we find universal trends in the hierarchy of groundedness (e.g., nouns > adjectives > verbs), and show that our measure partly correlates with psycholinguistic concreteness norms in English. We release a dataset of groundedness scores for 30 languages. Our results suggest that the grounded typology approach can provide quantitative evidence about semantic function in language.

artificial intelligence, machine learning, natural language, (19 more...)

arXiv.org Artificial Intelligence

2412.10369

Country:

North America > United States > California > Los Angeles County > Los Angeles (0.28)
Europe > United Kingdom > England > Oxfordshire > Oxford (0.04)
Europe > Netherlands > North Holland > Amsterdam (0.04)
(17 more...)

Genre: Research Report > New Finding (1.00)

Technology:

Information Technology > Artificial Intelligence > Machine Learning (1.00)
Information Technology > Artificial Intelligence > Natural Language > Text Processing (0.46)

Add feedback

Interview with Leanne Nortje: Visually-grounded few-shot word learning

AIHubJul-5-2023, 09:26:55 GMT

In their work Visually grounded few-shot word learning in low-resource settings, Leanne Nortje, Dan Oneata and Herman Kamper propose a visually-grounded speech model that learns new words and their visual depictions. In this interview, Leanne tells us more about their methodology and how it could be beneficial for low-resource languages. We look into using vision as a form of weakly transcribing audio. This will be particularly helpful for low-resource languages where, in extreme cases, such languages have no written form. We specifically consider the task of retrieving relevant images for a given spoken word by learning from only a few image-word pairs, i.e. to do multimodal few-shot word learning.

leanne nortje, low-resource language, visually-grounded few-shot word, (6 more...)

AIHub

Technology: Information Technology > Artificial Intelligence (0.35)

Add feedback

Leveraging multilingual transfer for unsupervised semantic acoustic word embeddings

Jacobs, Christiaan, Kamper, Herman

arXiv.org Artificial IntelligenceJul-5-2023

Acoustic word embeddings (AWEs) are fixed-dimensional vector representations of speech segments that encode phonetic content so that different realisations of the same word have similar embeddings. In this paper we explore semantic AWE modelling. These AWEs should not only capture phonetics but also the meaning of a word (similar to textual word embeddings). We consider the scenario where we only have untranscribed speech in a target language. We introduce a number of strategies leveraging a pre-trained multilingual AWE model -- a phonetic AWE model trained on labelled data from multiple languages excluding the target. Our best semantic AWE approach involves clustering word segments using the multilingual AWE model, deriving soft pseudo-word labels from the cluster centroids, and then training a Skipgram-like model on the soft vectors. In an intrinsic word similarity task measuring semantics, this multilingual transfer approach outperforms all previous semantic AWE methods. We also show -- for the first time -- that AWEs can be used for downstream semantic query-by-example search.

artificial intelligence, machine learning, natural language, (18 more...)

arXiv.org Artificial Intelligence

2307.02083

Country:

North America > Canada > Quebec > Montreal (0.04)
Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.04)
Africa > South Africa (0.04)

Genre: Research Report (0.64)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Natural Language > Text Processing (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (0.69)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.69)

Add feedback

UzbekTagger: The rule-based POS tagger for Uzbek language

Sharipov, Maksud, Kuriyozov, Elmurod, Yuldashev, Ollabergan, Sobirov, Ogabek

arXiv.org Artificial IntelligenceMar-1-2023

This research paper presents a part-of-speech (POS) annotated dataset and tagger tool for the low-resource Uzbek language. The dataset includes 12 tags, which were used to develop a rule-based POS-tagger tool. The corpus text used in the annotation process was made sure to be balanced over 20 different fields in order to ensure its representativeness. Uzbek being an agglutinative language so the most of the words in an Uzbek sentence are formed by adding suffixes. This nature of it makes the POS-tagging task difficult to find the stems of words and the right part-of-speech they belong to. The methodology proposed in this research is the stemming of the words with an affix/suffix stripping approach including database of the stem forms of the words in the Uzbek language. The tagger tool was tested on the annotated dataset and showed high accuracy in identifying and tagging parts of speech in Uzbek text. This newly presented dataset and tagger tool can be used for a variety of natural language processing tasks such as language modeling, machine translation, and text-to-speech synthesis. The presented dataset is the first of its kind to be made publicly available for Uzbek, and the POS-tagger tool created can also be used as a pivot to use as a base for other closely-related Turkic languages.

artificial intelligence, natural language, text processing, (17 more...)

arXiv.org Artificial Intelligence

2301.12711

Country:

Asia > Uzbekistan (0.05)
Europe > Spain (0.04)
Asia > India (0.04)

Genre: Research Report (0.50)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Text Processing (1.00)
Information Technology > Artificial Intelligence > Natural Language > Grammars & Parsing (1.00)

Add feedback

Word class representations spontaneously emerge in a deep neural network trained on next word prediction

Surendra, Kishore, Schilling, Achim, Stoewer, Paul, Maier, Andreas, Krauss, Patrick

arXiv.org Artificial IntelligenceFeb-15-2023

How do humans learn language, and can the first language be learned at all? These fundamental questions are still hotly debated. In contemporary linguistics, there are two major schools of thought that give completely opposite answers. According to Chomsky's theory of universal grammar, language cannot be learned because children are not exposed to sufficient data in their linguistic environment. In contrast, usage-based models of language assume a profound relationship between language structure and language use. In particular, contextual mental processing and mental representations are assumed to have the cognitive capacity to capture the complexity of actual language use at all levels. The prime example is syntax, i.e., the rules by which words are assembled into larger units such as sentences. Typically, syntactic rules are expressed as sequences of word classes. However, it remains unclear whether word classes are innate, as implied by universal grammar, or whether they emerge during language acquisition, as suggested by usage-based approaches. Here, we address this issue from a machine learning and natural language processing perspective. In particular, we trained an artificial deep neural network on predicting the next word, provided sequences of consecutive words as input. Subsequently, we analyzed the emerging activation patterns in the hidden layers of the neural network. Strikingly, we find that the internal representations of nine-word input sequences cluster according to the word class of the tenth word to be predicted as output, even though the neural network did not receive any explicit information about syntactic rules or word classes during training. This surprising result suggests, that also in the human brain, abstract representational categories such as word classes may naturally emerge as a consequence of predictive coding and processing during language acquisition.

artificial intelligence, machine learning, neural network, (16 more...)

arXiv.org Artificial Intelligence

2302.07588

Country:

Europe > Germany > Bavaria > Middle Franconia > Nuremberg (0.05)
North America > United States > Illinois > Cook County > Chicago (0.04)
North America > United States > California > Los Angeles County > Pasadena (0.04)
(2 more...)

Genre: Research Report > New Finding (0.49)

Industry: Health & Medicine > Therapeutic Area > Neurology (0.48)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)
Information Technology > Artificial Intelligence > Cognitive Science (1.00)

Add feedback

Development of a rule-based lemmatization algorithm through Finite State Machine for Uzbek language

Sharipov, Maksud, Sobirov, Ogabek

arXiv.org Artificial IntelligenceOct-28-2022

Lemmatization is one of the core concepts in natural language processing, thus creating a lemmatization tool is an important task. This paper discusses the construction of a lemmatization algorithm for the Uzbek language. The main purpose of the work is to remove affixes of words in the Uzbek language by means of the finite state machine and to identify a lemma (a word that can be found in the dictionary) of the word. The process of removing affixes uses a database of affixes and part of speech knowledge. This lemmatization consists of the general rules and a part of speech data of the Uzbek language, affixes, classification of affixes, removing affixes on the basis of the finite state machine for each class, as well as a definition of this word lemma.

artificial intelligence, natural language, uzbek language, (13 more...)

arXiv.org Artificial Intelligence

2210.16006

Country:

Asia > Uzbekistan > Toshkent Shahri > Tashkent (0.06)
Europe > Slovenia > Coastal-Karst > Municipality of Koper > Koper (0.04)
Asia > China > Shanghai > Shanghai (0.04)

Genre: Research Report (0.70)

Technology:

Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning > Rule-Based Reasoning (0.40)

Add feedback

UzbekStemmer: Development of a Rule-Based Stemming Algorithm for Uzbek Language

Sharipov, Maksud, Yuldashov, Ollabergan

arXiv.org Artificial IntelligenceOct-28-2022

In this paper we present a rule-based stemming algorithm for the Uzbek language. Uzbek is an agglutinative language, so many words are formed by adding suffixes, and the number of suffixes is also large. For this reason, it is difficult to find a stem of words. The methodology is proposed for doing the stemming of the Uzbek words with an affix stripping approach whereas not including any database of the normal word forms of the Uzbek language. Word affixes are classified into fifteen classes and designed as finite state machines (FSMs) for each class according to morphological rules. We created fifteen FSMs and linked them together to create the Basic FSM. A lexicon of affixes in XML format was created and a stemming application for Uzbek words has been developed based on the FSMs.

artificial intelligence, natural language, suffix, (18 more...)

arXiv.org Artificial Intelligence

2210.16011

Country:

Asia > Uzbekistan > Toshkent Shahri > Tashkent (0.05)
Europe > Switzerland (0.04)
Europe > Slovenia > Coastal-Karst > Municipality of Koper > Koper (0.04)

Genre: Research Report (0.50)

Technology:

Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning > Rule-Based Reasoning (0.71)

Add feedback